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Abstract 

Variables in many massive high-dimensional data sets are structured, arising for 
example from measurements on a regular grid as in imaging and time series or from 
spatial-temporal measurements as in climate studies. Classical multivariate techniques 
ignore these structural relationships often resulting in poor performance. We propose 
a generalization of the singular value decomposition (SVD) and principal components 
analysis (PCA) that is appropriate for massive data sets with structured variables or 
known two-way dependencies. By finding the best low rank approximation of the data 
with respect to a transposable quadratic norm, our decomposition, entitled the Gen- 
eralized least squares Matrix Decomposition (GMD), directly accounts for structural 
relationships. As many variables in high-dimensional settings are often irrelevant or 
noisy, we also regularize our matrix decomposition by adding two-way penalties to en- 
courage sparsity or smoothness. We develop fast computational algorithms using our 
methods to perform generalized PCA (GPCA), sparse GPCA, and functional GPCA 
on massive data sets. Through simulations and a whole brain functional MRI example 
we demonstrate the utility of our methodology for dimension reduction, signal recovery, 
and feature selection with high-dimensional structured data. 

Keywords: matrix decomposition, singular value decomposition, principal compo- 
nents analysis, sparse PCA, functional PCA, structured data, neuroimaging. 



1 Introduction 



Principal components analysis (PCA), and the singular value decomposition (SVD) upon 
which it is based, form the foundation of much of classical multivariate analysis. It is well 
known, however, that with both high-dimensional data and functional data, traditional 



PCA can perform poorly (Silverman 1996 JoUiffe et al. 2003 Johnstone and Lu 2004) 



Methods enforcing sparsity or smoothness on the matrix factors, such as sparse or functional 



PCA, have been shown to consistently recover the factors in these settings (Silverman 1996 



Johnstone and Lu 2004 1 . Recently, these techniques have been extended to regularize both 
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the row and column factors of a matrix (Huang et al. 2009 Witten et al. 2009 Lee et al 



20101. All of these methods, however, can fail to capture relevant aspects of structured high- 



dimensional data. In this paper, we propose a general and flexible framework for PCA that 
can directly account for structural dependencies and incorporate two-way regularization for 
exploratory analysis of massive structured data sets. 

Examples of high-dimensional structured data in which classical PCA performs poorly 
abound in areas of biomedical imaging, environmental studies, time series and longitudinal 
studies, and network data. Consider, for example, functional MRIs (fMRIs) which measure 
three-dimensional images of the brain over time with high spatial resolution. Each three- 
dimensional pixel, or voxel, in the image corresponds to a measure of the bold oxygenation 
level dependent (BOLD) response (hereafter referred to as "activation"), an indirect measure 
of neuronal activity at a particular location in the brain. Often these voxels are vectorized 
at each time point to form a high-dimensional matrix of voxels (« 10,000 - 100,000) by time 
points (« 100 - 10,000) to be analyzed by multivariate methods. Thus, this data exhibits 
strong spatial dependencies among the rows and strong temporal dependencies among the 
columns ( |Lindquist{ 2008[ Lazar[ [2008 ) . Multivariate analysis techniques are often applied 
to fMRI data to find regions of interest, or spatially contiguous groups of voxels exhibiting 
strong co-activation as well as the time courses associated with these regions of activity 
(Lindquist 2008). Principal components analysis and the SVD, however, are rarely used 



for this purpose. Many have noted that the first several principal components of fMRI data 
appear to capture spatial and temporal dependencies in the noise rather than the patterns of 
brain activation in which neuroscientists are interested. Furthermore, subsequent principal 
components often exhibit a mixture of noise dependencies and brain activation such that 



the signal in the data remains unclear (Friston et al. 1999 Calhoun et al. 2001 Thirion 



and Faugeras 2003 Viviani et al. 2005). This behavior of classical PCA is typical when it 



is applied to structured data with low signal compared to the noise dependencies induced 
by the data structure. 

To understand the poor performance of the SVD and PCA is these settings, we ex- 
amine the model mathematically. We observe data Y S W^^p, for which standard PCA 
considers the following model: Y = M -t- U D V"^ + E, where M denotes a mean matrix, 
D the singular values, U and V the left and right factors, respectively, and E the noise 
matrix. Implicitly, SVD models assume that the elements of E are independently and 
identically distributed, as the SVD is calculated by minimizing the Frobenius norm loss 
function: ||X — UDV^|||, — J27=iJ2^=ii^ij ~ u^DvJ)^, where is the i*'^ column 
of U and Vj is analogous and X denotes the centered data, X = Y — M. This sums 
of squares loss function weights errors associated with each matrix element equally and 
cross-product errors, between elements Xij and Xi>ji, for example, are ignored. It comes 
as no surprise then that the Frobenius norm loss and thus the SVD perform poorly with 
data exhibiting strong structural dependencies among the matrix elements. To permit un- 
equal weighting of the matrix errors according to the data structure, we assume that the 
noise is structured: vec(E) ~ (0, (g) Q~^), or the covariance of the noise is separable 
and has a Kronecker product covariance ([Gupta and Nagar 1999). Our loss function. 



then changes from the Frobenius norm to a transposable quadratic norm that permits 
unequal weighting of the matrix error terms based on Q and R: || X — U D V"^ ||q — 

ELi EU S/=i Q»' ^n' - u, D vj) (X,,,, - u,, D vp . By finding the ' best 

low rank approximation of the data with respect to this transposable quadratic norm, we 
develop a decomposition that directly accounts for two-way structural dependencies in the 
data. This gives us a method. Generalized PCA, that extends PCA for applications to 
structured data. 

While this paper was initially under review, previous work on unequal weighting of ma- 
trix elements in PCA via a row and column generalizing operators came to our attention 
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(Escoufier 19771. This so called, "duality approach to PCA" is known in the French mul- 



tivariate community, but has perhaps not gained the popularity in the statistics literature 
that it deserves. Developed predominately in the 1970's and 80's for applications in eco- 



logical statistics, few texts on this were published in English (Escoufier 1977 Tenenhaus 



and Young 1985 Escoufier 1987). Recently, a few works review these methods in relation 



to classical multivariate statistics (Escoufier 2006 Purdom 2007 Dray and Dufour 2007 



Holmes 2008 ) . While these works propose a mathematical framework for unequal weighting 



matrices in PCA, much more statistical and computational development is needed in order 
to directly apply these methods to high-dimensional structured data. 

In this paper, we aim to develop a flexible mathematical framework for PCA that ac- 
counts for known structure and incorporates regularization in a manner that is computation- 
ally feasible for analysis of massive structured data sets. Our specific contributions include 
(1) a matrix decomposition that accounts for known two-way structure, (2) a framework 
for two-way regularization in PCA and in Generalized PCA, and (3) results and algorithms 
allowing one to compute (1) and (2) for ultra high-dimensional data. Specifically, beyond 
the previous work in this area reviewed by Holmes (2008), we provide results allowing for (i) 
general weighting matrices, (ii) computational approaches for high-dimensional data, and 
(iii) a framework for two-way regularization in the context of Generalized PCA. As our 
methods are flexible and general, they will enjoy wide applicability to a variety of struc- 
tured data sets common in medical imaging, remote sensing, engineering, environmetrics, 
and networking. 

The paper is organized as follows. In Sections 2.1 through 2.5, we develop the math- 
ematical framework for our Generalized Least Squares Matrix Decomposition and resulting 
Generalized PCA. In these sections, we assume that the weighting matrices, or quadratic 
operators are fixed and known. Then, by discussing interpretations of our approach and 
relations to existing methodology, we present classes of quadratic operators in Section 2.6 
that are appropriate for many structured data sets. As the focus of this paper is the statis- 
tical methodology for PCA with high-dimensional structured data, our intent here is to give 
the reader intuition on the quadratic operators and leave other aspects for future work. In 
Section |3) we propose a general framework for incorporating two-way regularization in PCA 
and Generalized PCA. We demonstrate how this framework leads to methods for Sparse and 
Functional Generalized PCA. In Section |4j we give results on both simulated data and real 
functional MRI data. We conclude, in Section [5j with a discussion of the assumptions, 
implications, and possible extensions of our work. 



2 A Generalized Least Squares Matrix Decomposition 

We present a generalization of the singular value decomposition (SVD) that incorporates 
known dependencies or structural information about the data. Our methods can be used to 
perform Generalized PCA for massive structured data sets. 

2.1 GMD Problem 

Here and for the remainder of the paper, we will assume that the data, X has previously 
been centered. We begin by reviewing the SVD which can be written as X = U D V"^ with 
the data X e 3?"^p, and where U e 3fi"^P and V e 3?p^p are orthonormal and D e 
is diagonal with non-negative diagonal elements. Suppose one wishes to find the best low 
rank, K < mm{n,p), approximation to the data. Recall that the SVD gives the best rank- if 
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approximation with respect to the Frobenius norm (Eckart and Young 1936): 
minimize ||X-UDV'^||l 

U:rank(U)=if,D:DeX',V:rank(V)=A' 

subject to U^U = I(A'), V'^V = I(K) & diag(D) > 0. (1) 

Here, V is the class of diagonal matrices. The solution is given by the first K singular 
vectors and singular values of X. 

The Frobenius norm weights all errors equally in the loss function. We seek a general- 
ization of the SVD that allows for unequal weighting of the errors to account for structure 
or known dependencies in the data. To this end, we introduce a loss function given by a 
transposable quadratic norm, the Q,R-norm, defined as follows: 

Definition 1 Let Q e 5R"^" and R € he positive semi-definite matrices, Q,R ^ 0. 



Then the Q,R-norm (or semi-norm) is defined as ||X||q^r = Ytr(QXRX ). 

We note that if Q and R are both positive definite, Q,R ;^ 0, then the Q,R-norm is a 
proper matrix norm. If Q or R are positive semi-definite, then it is a semi-norm, meaning 
that for X ^ 0, || X ||q,r = if X e null{Q,) or if X^ G null{R). We call the Q, R-norm a 
transposable quadratic norm, as it right and left multiplies X and is thus an extension of the 



quadratic norm, or "norm ind uced by an inner product space" ( Boyd and Vandenberghe 

1985|. Note that ||X||q ^ ||X||qj and ||X^ ||r ^ ||X^ ||rj are 



2004 



Horn and Johnson 



then quadratic norms. 

The GMD is then taken to be the best rank- A' approximation to the data with respect 
to the Q, R-norm: 

minimize 1 1 X - U D 1 1?, „ 

U:iank(U)=i<',D:DeX',V:rank(V)=if ^' 

subject to U^QU = I(K), V^RV = I(;^) & diag(D) > 0. (2) 

So we do not confuse the elements of the GMD with that of the SVD, we call U and V 
the left and right GMD factors respectively, and the diagonal elements of D, the GMD 
values. The matrices Q and R are called the left and right quadratic operators of the GMD 
respectively. Notice that the left and right GMD factors are constrained to be orthogonal 
in the inner product space induced by the Q-norm and R-norm respectively. Since the 
Frobenius norm is a special case of the Q, R-norm, the GMD is in fact a generalization of 
the SVD. 

We pause to motivate the rationale for finding a matrix decomposition with respect to 
the Q, R-norm. Consider a matrix-variate Gaussian matrix, X '-^ iV„.p(U D V^, Q^^, R^^), 
or vec(X) ^ N{yec{\J D V^), R~^ <Si Q~^) in terms of the familiar multivariate normal. The 
normal log-likelihood can be written as: 

^(X|Q"\R"^) cxtr (q(X-UDV^)R(X-UDV^)^) = || X - U D | |^.r. 

Thus, as the Frobenius norm loss of the SVD is proportional to the log-likclihood of the 
spherical Gaussian with mean UDV^, the Q,R -norm loss is proportional to the log- 
likelihood of the matrix-variate Gaussian with mean UDV"^, row covariance Q~^, and 
column covariance R~^. In general, using the Q, R-norm loss assumes that the covariance 
of the data arises from the Kronecker product between the row and column covariances, 
or that the covariance structure is separable. Several statistical tests exist to check these 
assumptions with real data (Mitchell et al. 2006 Li et al. 20081. Hence, if the dependence 
structure of the data is known, taking the quadratic operators to be the inverse covariance 
or precision matrices directly accounts for two-way dependencies in the data. We assume 
that Q and R are known in the development of the GMD and discuss possible choices for 
these with structured data in Section \T5\ 
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2.2 GMD Solution 

The GMD solution, X — U* D*(V*)-^, is comprised of U*,D* and V*, the optimal points 
minimizing the GMD problem ([2]). The following result states that the GMD solution is an 
SVD of an altered data matrix. 

Theorem 1 Suppose rank(Q) = I and rank(R) = m. Decompose Q and R by letting 
Q = QQ and R = RR where Q G 3?"^' and R G are of full column rank. 

Define X = Q^XR and let X = UDV^ be the SVD of ±. Then, the GMD solution, 
X = U*D*(V*)^, IS given by the GMD factors U* = Q^^U and V* = R^^V and the 
GMD values, D* = D. Here, Q and R are any left matrix inverse: (Q = 
and (R )-^R = !(,„). 

We make some brief comments regarding this result. First , the decomposition, Q = Q Q 
where Q G 3?"^' is of full column rank, exists since Q ^ (Horn and Johnson 19851; the 



decomposition for R exists similarly. A possible form of this decomposition, the resulting 
X, and the GMD solution X can be obtained from the eigenvalue decomposition of Q and 
R: Q = Tq Aq Tq and R = Tr, Ar T^. If Q is fuU rank, then we can take Q = Q^/^ = 

rqAQ^^rq, giving the GMD factor U* = Tq Aq^^^ Tq tj, and similarly for R and V. 

~ 1/2 

On the other hand, if Q 0, then a possible value for Q is rQ(-, 1 : I) Aq (1 : Z, 1 : /), 
giving an n X I matrix with full column rank. The GMD factor, U* is then given by 

rQ(.,l:OAqi/'(l:Z,l:OU. 

From Theorem [TJ we see that the GMD solution can be obtained from the SVD of 
X. Let us assume for a moment that X is matrix-variate Gaussian with row and column 
covariance and R~^ respectively. Then, the GMD is like taking the SVD of the 

sphered data, as right and left multiplying by Q and R yields data with identity covariance. 
In other words, the GMD de-correlates the data so that the SVD with equally weighted 
errors is appropriate. While the GMD values are the singular values of this sphered data, 
the covariance is multiplied back into the GMD factors. 

This relationship to the matrix-variate normal also begs the question, if one has data with 
row and column correlations, why not take the SVD of the two-way sphered or "whitened" 
data? This is inadvisable for several reasons. First, pre-whitening the data and then taking 
the SVD yields matrix factors that are in the wrong basis and are thus uninterpretable in 
the original data space. Given this, one may advocate pre-whitening, taking the SVD and 
then re-whitening, or in other words multiplying the correlation back in to the SVD factors. 
This approach, however, is still problematic. In the special case where Q and R are of 
full rank, the GMD solution is exactly to same as this pre and re-whitening approach. In 
the general case where Q and R are positive semi-definite, however, whitening cannot be 
directly performed as the covariances are not full rank. In the following section, we will show 
that the GMD solution can be computed without taking any matrix inverses, square roots 
or eigendecompositions and is thus computationally more attractive than naive whitening. 
Finally, as our eventual goal is to develop a framework for both structured data and two- 
way regularization, naive pre-whitening and then re-whitening of the data would destroy 
any estimated sparsity or smoothness from regularization methods and is thus undesirable. 
Therefore, the GMD solution given in Theorem [T] is the mathematically correct way to 
"whiten" the data in the context of the SVD and is superior to a naive whitening approach. 

Next, we explore some of the mathematical properties of our GMD solution, X — 
U* D*(V*)^ in the following corollaries: 

Corollary 1 The GMD solution is the global minimizer of the GMD problem, ([2|. 
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Corollary 2 Assume that range{Cl) D null(X.) = and range(R) D nullCX.) = 0. Then, 
there exists at most k = min{rank(X), rank(Q), rank(R)} non-zero GMD values and corre- 
sponding left and right GMD factors. 

Corollary 3 With the assumptions and k defined as in Gorollary the rank-k GMD 
solution has zero reconstruction error in the Q, 'R.-norm. If in addition, k = rank(X) and 
Q and R are full rank, then X = U*D*(V*)'^, that is, the GMD is an exact matrix 
decomposition. 

Corollary 4 The GMD values, D* , are unique up to multiplicity. If in addition, Q and R 
are full rank and the non-zero GMD values are distinct, then the GMD factors U* and V* 
corresponding to the non-zero GMD values are essentially unique (up to a sign change) and 
the GMD factorization is unique. 

Some further comments on these results are warranted. In particular, Theorem [T] is 
straightforward and less interesting when Q and R are full rank, the case discussed in 



Escoufier 


2006 


Purdom 


2007 Holmes 


2008 



. When the quadratic operators are positive 



semi-definite, however, the fact that a global minimizer to the GMD problem, which is 
non-convex, that has a closed form can be obtained is not immediately clear. The result 
stems from the relation of the GMD to the SVD and the fact that the latter is a unique 
matrix decomposition in a lower dimensional subspace defined in Corollary[2j Also note that 
when Q and R are full rank the GMD is an exact matrix decomposition; in the alternative 
scenario, we do not recover X exactly, but obtain zero reconstruction error in the Q, R-norm. 
Finally, we note that when Q and R are rank deficient, there are possibly many optimal 
points for the GMD factors, although the resulting GMD solution is still a global minimizer 
of the GMD problem. In Section 2.6, we will see that permitting the quadratic operators to 
be positive semi-definite allows for much greater flexibility when modeling high-dimensional 
structured data. 



2.3 GMD Algorithm 

While the GMD solution given in Theorem [T] is conceptually simple, its computation, 
based on computing Q, R and the SVD of X, is infeasible for massive data sets common 
in neuroimaging. We seek a method of obtaining the GMD solution that avoids computing 
and storing Q and R and thus permits our methods to be used with massive structured 
data sets. 

Proposition 1 If u and v are initialized such that Q u* 7^ and v'^ Q v* 7^ 0, then 
Algorithm [7] converges to the GMD solution given in Theorem [7} the global minimizer of 
the GMD problem. If in addition, Q and R are full rank, then it converges to the unique 
global solution. 

In Algorithm [T] we give a method of computing the components of the GMD solu 



tion that is a variation of the familiar power method for calculating the SVD (Golub and 
Van Loan 1996[ ). Proposition[I]states that the GMD Algorithm based on this power method 



converges to the GMD solution. Notice that we do not need to calculate Q and R and thus, 
this algorithm is less computationally intensive for high-dimensional data sets than finding 
the solution via Theorem [l] or the computational approaches given in Escoufier ( 1987 ) for 
positive definite operators. 

At this point, we pause to discuss the name of our matrix decomposition. Recall that the 
power method, or alternating least squares method, sequentially estimates u with v fixed 
and vice versa by solving least squares problems and then re-scaling the estimates. Each 
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Algorithm 1 GMD Algorithm (Power Method) 



1. Let = X and initialize Ui and vi. 

2. For k^l...K: 

(a) Repeat until convergence: 

• Set Ufe = "-^^ ■ 

• Set Vfc = — \ (fc) ^ ^ — . 

(b) Set dk = u^QX^'^-'Rvfe. 

(c) Set X = X — Ufc d 

3. Return U* = [ui, . . . , uk], V* ~ [vi, . . . v^] and D* = diag(di 



step of the GMD algorithm, then, estimates the factors by solving the following generalized 
least squares problems and re-scaling: 1 1 X — (d' u') ||q and 1 1 X^ — (d' v') ||^. This is 
then the inspiration for the name of our matrix decomposition. 

2.4 Generalized Principal Components 

In this section we show that the GMD can be used to perform Generalized Principal Com- 



ponents Analysis (GPCA). Note that this result was first shown in Escoufier (1977) for 
positive definite generalizing operators, but we review it here for completeness. The results 
in the previous section allow one to compute these GPCs for high-dimensional data when 
the quadratic operators may not be full rank and we wish to avoid computing SVDs and 
eigenvalue decompositions. 

Recall that the SVD problem can be written as finding the linear combination of variables 
maximizing the sample variance such that these linear combinations are orthonormal. For 
GPCA, we seek to project the data onto the linear combination of variables such that the 
sample variance is maximized in a space that accounts for the structure or dependencies in 
the data. By transforming all inner product spaces to those induced by the Q,R-norm, we 
arrive at the following Generalized PGA problem: 

maximize VfcRX"^QXRvfc subject to VfcRvfc = l& VfcRvfc'=0 V k' < k. (3) 

Vfc 

Notice that the loadings of the GPCs are orthogonal in the R-norm. The /c*'' GPC is then 
given by Zfc = XRvfc. 

Proposition 2 The solution to the k*^ Generalized principal component problem, ([3|, is 
given by the k*-^ right GMD factor. 



Corollary 5 The proportion of variance explained by the fc*'' Generalized principal compo- 

|2 

Iq,r- 



nent is given 6y X 



Just as the SVD can be used to find the principal components, the GMD can be used 
to find the generalized principal components. Hence, GPCA can be performed using the 
GMD algorithm which does not require calculating Q and R. This allows one to efficiently 
perform GPCA for massive structured data sets. 
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2.5 Interpretations, Connections &z Quadratic Operators 



In the previous sections, we have introduced the methodology for the Generahzed Least 
Squares Matrix Decomposition and have outhned how this can be used to compute GPCs 
for high-dimensional data. Here, we pause to discuss some interpretations of our methods 
and how they relate to existing methodology for non-linear dimension reduction. These 
interpretations and relationships will help us understand the role of the quadratic operators 
and the classes of quadratic operators that may be appropriate for types of structured high- 
dimensional data. As the focus of this paper is the statistical methodology of PCA for 
high-dimensional structured data, we leave much of the study of quadratic operators for 
specific applications as future work. 

First, there are many connections and interpretations of the GMD in the realm of classical 
matrix analysis and multivariate analysis. We will refrain from listing all of these here. 



but note that there are close relations to the GSVD of Van Loan ( 1976 ) and generalized 



eigenvalue problems (Golub and Van Loan 1996) as well as statistical methods such as 
discriminant analysis, canonical correlations analysis, and correspondence analysis. Many 
of these connections are discussed in Holmes ( 2008 ) and Purdom ( 2007 ) . Recall also that the 



GMD can be interpreted as a maximum likelihood problem for the matrix-variate normal 
with row and column inverse covariances fixed and known. This connection yields two 
interpretations worth noting. First, the GMD is an extension of the SVD to allow for 
heteroscedastic (and separable) row and column errors. Thus, the quadratic operators act 
as weights in the matrix decomposition to permit non i.i.d. errors. The second relationship is 
to whitening the data with known row and column covariances prior to dimension reduction. 
As discussed in Section 2.2 our methodology offers the proper mathematical context for 



this whitening with many advantages over a naive whitening approach. 

Next, the GMD can be interpreted as decomposing a covariance operator, 
assume that the data follows a simple model: X — S -I- E, where S = 
with the factors v/j and fixed but unknown and with the amplitude (jjk random, and 
E ^ Nn^p{0, Q~^,R~^). Then the (vectorized) covariance of the data can be written as: 



Let us 
■fe Ufe vl 



K 



Cov(vec(X)) = > Var(^fc)vfeV^ 



A;=l 



R 



Q 



such that V RV = I and U QU = I. In other words, the GMD decomposes the covari- 
ance of the data into a signal component and a noise component such that the eigenvectors 
of the signal component are orthonormal to those of the noise component. This is an impor- 
tant interpretation to consider when selecting quadratic operators for particular structured 
data sets, discussed subsequently. 

Finally, there is a close connection between the GMD and smoothing dimension reduction 
methods. Notice that from Theorem [T| the GMD factors U and V will be as smooth as 
as the smallest eigenvectors corresponding to non-zero eigenvalues of Q and R respectively. 
Thus, if Q and R are taken as smoothing operators, then the GMD can yield smoothed 
factors. 

Given these interpretations of the the GMD, we consider classes of quadratic operators 
that may be appropriate when applying our methods to structured data: 

1. Model-based and parametric operators. The quadratic operators could be taken as the 
inverse covariance matrices implied by common models employed with structured data. 
These include well-known time series autoregressive and moving average processes, 
random field models used for spatial data, and Gaussian Markov random fields. 

2. Graphical operators. As many types of structured data can be well represented by 
a graphical model, the graph Laplacian operator, defined as the difference between 
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the degree and adjacency matrix, can be used as a quadratic operator. Consider, for 
example, image data sampled on a regular mesh grid; a lattice graph connecting all 
the nearest neighbors well represents the structure of this data. 

3. Smoothing / Structural embedding operators. Taking quadratic operators as smooth- 
ing matrices common in functional data analysis will yield smoothed GMD factors. In 
these smoothing matrices, two variables are typically weighted inversely proportional 
to the distance between them. Thus, these can also be thought of as structural em- 
bedding matrices, increasing the weights in the loss function between variables that 
are close together on the structural manifold. 

While this list of potential classes of quadratic operators is most certainly incomplete, 
these provide the flexibility to model many types of high-dimcnsional structured data. No- 
tice that many examples of quadratic operators in each of these classes are closely related. 
Consider, for example, regularly spaced ordered variables as often seen in time series. A 
graph Laplacian of a nearest neighbor graph is tri-diagonal, is nearly equivalent except for 
boundary points to the squared second differences matrix often used for inverse smoothing 
in functional data analysis (Ramsay 2006), and has the same zero pattern as the inverse 



covariance of an autoregressive and moving average order one process (Shaman, 1969 GaL 



braith and Galbraith 1974). Additionally, the zero patterns in the inverse covariance matrix 



of Gaussian Markov random fields are closely connected to Markov networks and undirected 



graph Laplacians (Rue and Held 2005). A recent paper, Lindgren et al. (2011), shows how 



common stationary random field models specified by their covariance function such as the 
Matern class can be derived from functions of a graph Laplacian. Finally, notice that graph- 
ical operators such as Laplacians and smoothing matrices are typically not positive definite. 



Thus, our framework allowing for general quadratic operators in Section 2.2 opens more 



possibilities than diagonal weighting matrices (Dray and Dufour 2007) and those used in 



ecological applications (Escoufier 1977 Tenenhaus and Young 1985). 



There are many further connections of the GMD when employed with specific quadratic 
operators belonging to the above classes and other recent methods for non-linear dimension 
reduction. Let us first consider the relationship of the GMD to Functional PCA (FPCA) 



and the more recent two-way FPCA. Silverman ( 1996 ) first showed that for discretized 
functional data, FPCA can be obtained by half-smoothing; Huang et al. ( 2009[ ) elegantly 
extended this to two-way half- smoothing. Let us consider the latter where row and column 
smoothing matrices, S„ = (I -I- A„ flu)~^ and S.y ~ (I + At, Ht,)"^ respectively, are formed 
with ft being a penalty matrix such as to give the squared second differences for example 
(Ramsay 2006) related to the structure of the row and column variables. Then, Huang 



et al. (2009) showed that two-way FPCA can be performed by two-way half smoothing: 
(1) take the SVD of X' = S^^XSy^ then (2) the FPCA factors are U = U' and 
V = V'. In other words, a half smoother is applied to the data, the SVD is taken and 
another half-smoother is applied to the SVD factors. This procedure is quite similar to the 
GMD solution outlined in Theorem[T] Let us assume that Q and R are smoothing matrices. 
Then, our GMD solution is essentially like half-smoothing the data, taking the SVD then 
inverse half-smoothing the SVD factors. Thus, unlike two-way FPCA, the GMD does not 
half-smooth both the data and the factors, but only the data and then finds factors that are 
in contrast to the smoothed data. If the converse is true and we take Q and R to be "inverse 
smoothers" such as a graph Laplacian, then this inverse smoother is applied to the data, the 
SVD is taken and the resulting factors are smoothed. Dissimilar to two-way FPCA which 
performs two smoothing operations, the GMD essentially applies one smoothing operation 
and one inverse smoothing operation. 

The GMD also shares similarities to other non-linear dimension reduction techniques 
such as Spectral Clustering, Local Linear Embedding, Manifold Learning, and Local Multi- 
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Dimensional Scaling. Spectral clustering, Laplacian embedding, and manifold learning 
all involve taking an eigendecomposition of a Laplacian matrix, L, capturing the dis- 
tances or relationships between variables. The eigenvectors corresponding to the smallest 
eigenvalues are of interest as these correspond to minimizing a weighted sums of squares: 
f-^Lf = X^iSi' Li^i'ifi — fi'Y, ensuring that the distances between neighboring variables 
in the reconstruction f are small. Suppose Q = L and R = I, then the GMD minimizes: 
Si Si' -^i,i'(Xi — Ui D V"^)^, similar to the Laplacian embedding criterion. Thus, the GMD 
with quadratic operators related to the structure or distance between variables ensures that 
the errors between the data and its low rank approximation are smallest for those variables 
that are close in the data structure. This concept is also closely related to Local Multi- 
Dimensional Scaling which weights MDS locally by placing more weight on variables that 
are in close proximity (Chen and Buja 2009). 

The close connections of the GMD to other recent non-linear dimension reduction tech- 
niques provides additional context for the role of the quadratic operators. Specifically, these 
examples indicate that for structured data, it is often not necessary to directly estimate 
the quadratic operators from the data. If the quadratic operators are taken as smoothing 
matrices, structural embedding matrices or Laplacians related to the distances between vari- 
ables, then the GMD has similar properties to many other non-linear unsupervised learning 
methods. In summary, when various quadratic operators that encode data structure such 
as the distances between variables are employed, the GMD can be interpreted as (1) finding 
the basis vectors of the covariance that are separate (and orthogonal) to that of the data 
structure, (2) finding factors that are smooth with respect to the data structure, or (3) 
finding an approximation to the data that has small reconstruction error with respect to the 
structural manifold. Through simulations in Section 4.1 we will explore the performance 
of the GMD for different combinations of graphical and smoothing operators encoding the 
known data structure. As there are many examples of structured high-dimensional data (e.g. 
imaging, neuroimaging, microscopy, hyperspectral imaging, chemometrics, remote sensing, 
environmental data, sensor data, and network data), these classes of quadratic operators 
provide a wide range of potential applications for our methodology. 



3 Generalized Penalized Matrix Factorization 



With high-dimensional data, many have advocated regularizing principal components by 
either automatically selecting relevant features as with Sparse PGA or by smoothing the 



factors as with Functional PGA ( 


Silverman 


1996 


Jolliffe et al. 


2003 


Zou et al. 


2006 


Shen 


and Huang 


2008 


Huang et al. 


2009 


Witten et al. 


2009 


Lee et al. 


2010 


). Regularized 



PGA can be important for massive structured data as well. Consider spatio-temporal fMRI 
data, for example, where many spatial locations or voxels in the brain are inactive and the 
time courses are often extremely noisy. Automatic feature selection of relevant voxels and 
smoothing of the time series in the context of PGA for structured data would thus be an 
important contribution. In this section, we seek a framework for regularizing the factors of 
the GMD by placing penalties on each factor. In developing this framework, we reveal an 
important result demonstrating the general class of penalties that can be placed on matrix 
factors that are to be estimated via deflation: the penalties must be norms or semi-norms. 



3.1 GPMF Problem 



Recently, many have suggested regularizing the factors of the SVD by forming two-way 
penalized regression problems ( Huang et al. 2009 Witten et al. 2009 Lee et al. 2010). We 



briefly review these existing approaches to understand how to frame a problem that allows 
us to place general sparse or smooth penalties on the GMD factors. 
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We compare the optimization problems of these three approaches for computing a single- 
factor two-way regularized matrix factorization: 



Witten et aT] (|2009| : maximize u^Xv subject to u^u < 1, v < 1, Pi(u) < ci, & P2{v) < C2. 



A 



Huang et al. (20091 : maximize X v --Pi(u)P2(v) 



Lee et al.|(|2010|) : maximize X v u"^ u v'^ v -4^Fi(u) - ^P2(v). 
' ' ' v,u 2 2 2 



Here, ci and C2 are constants, Pi() and P2{) are penalty functions, and A, Av and Au are 
penalty parameters. These optimization problems are attractive as they are bi-concave 
in u and v, meaning that they are concave in u with v fixed and vice versa. Thus, a 
simple maximization strategy of iteratively maximizing with respect to u and v results in a 
monotonic ascent algorithm converging to a local maximum. 

These methods, however, differ in the types of penalties that can be employed and their 



scaling of the matrix factors. Witten et al. ( 2009 1 explicitly restrict the norms of the factors 



while the constraints in the method of Huang et al. ( 2009 ) are implicit because of the types 



of functional data penalties employed. Thus, for these methods, as the constants ci and C2 or 
the penalty parameter, A approach zero, the SVD is returned. This is not the case, however. 



for problem in Lee et al. (2010) where the factors are not constrained in the optimization 



problem, although they are later scaled in their algorithm. Also, restricting the scale of 
the factors avoids possible numerical instabilities when employing the iterative estimation 
algorithm (see especially the supplementary materials of Huang et al. (2009)). Thus, we 



prefer the former two approaches for these reasons. In jWitten et al. (2009), however, only 
the lasso and fused lasso penalty functions are employed and it is unclear whether other 
penalties may be used in their framework. Huang et al. (20091, on the other hand, limit 



their consideration to quadratic functional data penalties, and their optimization problem 
does not implicitly scale the factors for other types of penalties. 

As we wish to employ general penalties, specifically sparse and smooth penalties, on 
the matrix factors, we adopt an optimization problem that leads to simple solutions for 
the factors with a wide class of penalties, as discussed in the subsequent section. We then, 
define the single-factor Generalized Penalized Matrix Factorization (GPMF) problem as the 
following: 

maximize u QXRv- AvPi(v) - AuP2(u) subject to u^Qu < 1 & v^Rv < 1, (4) 



where, as before, Av and Au are penalty parameters and Pi() and P2() are penalty functions. 
Note that if Au = Av = 0, then the left and right GMD factors can be found from (|4]), as 
desired. Strictly speaking, this problem is the Lagrangian of the problem introduced in 



Witten et al. (2009), keeping in mind that we should interpret the inner products as those 



induced by the Q,R-norm. As we will see in Theorem [2] in the next section, this problem 
is tractable for many common choices of penalty functions and avoids the scaling problems 
of other approaches. 



3.2 GPMF Solution 

We solve the GPMF criterion, via block coordinate ascent by alternating maximizing 
with respect to u then v. Note that if Av = or Au = 0, then the coordinate update for 
u or V is given by the GMD updates in Step (b) (i) of the GMD Algorithm. Consider the 
following result: 
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Theorem 2 Assume that Pi {) and P2{) are convex and homogeneous of order one: P{cx) = 
cP{x) V c > 0. Let u be fixed at u' or v be fixed at v' such that u'-^ Q X ^ or v'-^ R X"^ ^ 0. 
Then, the coordinate updates, u* and v* , maximizing the single-factor GPMF problem, Q, 



are given by the following: 
argmin„{i|| XRv' — u ||q - 



v||r 



Let V — argmin^{i| 
Au^2(u)}. Then, 



X' Qu' - V + AvPi(v)} and 




«/I|v||r > 
otherwise, 




«/I|u||q 

otherwise. 



> 



Theorem [2] states that for penalty functions that are convex and homogeneous of order 
one, the coordinate updates giving the single-factor GPMF solution can be obtained by a 
generalized penalized regression problem. Note that penalty functions that are norms or 
semi-norms are necessarily convex and homogeneous of order one. This class of functions 
includes common penalties such as the lasso (Tibshirani 1996"), group lasso (Yuan and 



Lin 2006), fused lasso (Tibshirani et al. 2005), and the generalized lasso ( |Tibshirani and 
Taylor 2011 1. Importantly, these do not include the ridge penalty, elastic net, concave 



penalties such as SCAD, and quadratic smoothing penalties commonly used in functional 
data analysis. Many of the penalized regression solutions for these penalties, however, can 
be approximated by penalties that are norms or semi-norms. For instance, to mimic the 
effect of a given quadratic penalty, one may use the square-root of this quadratic penalty. 



Similarly, the natural majorization-minimization algorithms for SCAD-type penalties (Fan 



and Li 2001a 



involve convex piecewise linear majorizations of the penalties that satisfy 
the conditions of Theorem [2] Thus, our single-factor GPMF problem both avoids the 
complicated scaling problems of some existing two-way regularized matrix factorizations, 
and still permits a wide class of penalties to be used within our framework. 

Following the structure of the GMD Algorithm, the multi-factor GPMF can be computed 
via the power method framework. That is, the GPMF algorithm replaces Steps (b) (i) of 
the GMD Algorithm with the single factor GPMF updates given in The orem [g} This 



follows the same approach as that of other two-way matrix factorizations (Huang et al 



2009 Witten et al. 2009 Lee et al. 2010). Notice that unhke the GMD, subsequent 



factors computed via this greedy deflation approach will not be orthogonal in the Q, R-norm 
to previously computed components. Many have noted in the Sparse PCA literature for 



example, that orthogonality of the sparse components is not necessarily warranted (Jolliffe 



et al. 


2003 


Zou et al. 


2006 


Shen and Huang 


2008 



orthogonality can be enforced in subsequent components via a Gram-Schmidt scheme ( Golub 



and Van Loan 1996 1 



In the following sections, we give methods for obtaining the single-factor GPMF updates 
for sparse or smooth penalty types, noting that many other penalties may also be employed 
with our methods. As the single-factor GPMF problem is symmetric in u and v, we solve 
the R-norm penalized regression problem in v, noting that the solutions are analogous for 
the Q-norm penalized problem in u. 



3.3 Sparsity: Lasso and Related Penalties 

With high-dimensional data, sparsity in the factors or principal components yields greater 
interpretability and, in some cases have better properties than un-penalized principal com- 
ponents (Jolliffe et al. 2003 Johnstone and Lu 2004). With neuroimaging data, sparsity in 
the factors associated with the voxels is particularly warranted as in most cases relatively 
few brain regions are expected to truly contribute to the signal. Hence, we consider our 
GPMF problem, with the lasso ( |Tibshirani[ |1996[ ), or £i-norm penalty, commonly used 
to encourage sparsity. 
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The penahzed regression problem given in Theorem [2] can be written as a lasso problem: 
i| I X"^ Q u — V 11^ + Av||v||i. If R = I, then the solution for v is obtained by simply 



- V IIr + Av II V 111. 

applying the soft thresholding operator: 



V = S{X' Qu,A), where S{x,X) = sign(a;)(|x| 



A)+ (Tibshirani 1996). When R 7^ I, the solution is not as simple: 



Claim 1 If Tl is diagonal with strictly positive diagonal entries, then v minimizing the 
R-norm lasso problem is given hy 'v — S'(X"^ Q u, Av R~^ ■'-(p))- 

Claim 2 The solution, v, that minimizes the R-norm lasso problem can be obtained by 
iterative coordinate-descent where the solution for each coordinate Vj is given by: Wj — 



(Rrj X^ Q u - Kj^^j v^j, A, 
sociated with column j of R. 



with the subscript R^j denoting the row elements as- 



Claim [T] states that when R is diagonal, the solution for v can be obtained by soft 
thresholding the elements of y by a vector penalty parameter. For general R, Claim [2] 
gives that we can use coordinate-descent to find the solution without having to compute 
R. Thus, the sparse single-factor GPMF solution can be calculated in a computationally 
efficient manner. One may further improve the speed of coordinate-descent by employing 



warm starts and iterating over active coordinates as described in Friedman et al. (2010). 

We have discussed the details of computing the GPMF factors for the lasso penalty, 
and similar techniques can be used to efRciently compute the solution for the group lasso 



(Yuan and Lin 2006), fused lasso (Tibshirani et al. 2005), generalized lasso (Tibshirani 



and Taylor 2011) and other sparse convex penalties. As mentioned, we limit our class of 



penalty functions to those that are norms or semi-norms, which does not include popular 
concave penalties such as the SCAD penalty (Fan and Li 2001b). As mentioned above. 



these penalties, however, can still be used in our framework as concave penalized regression 



problems can be solved via iterative weighted lasso problems (Mazumder et al. 2009 1. Thus, 
our GPMF framework can be used with a wide range of penalties to obtain sparsity in the 
factors. 

Finally, we note that as our GMD Algorithm can be used to perform GPCA, the sparse 
GPMF framework can be used to find sparse generalized principal components by setting 
Au = in (|4|. This is an approach well-studied in the Sparse PC A hterature (iShen and 



Huang 


2008 


Witten et al. 


2009 Lee et al. 


2010 



3.3.1 Smoothness: rj-norm Penalty &; Functional Data 

In addition to sparseness, there is much interest in penalties that encourage smoothness, 
especially in the context of functional data analysis. We show how the GPMF can be used 
with smooth penalties and propose a generalized gradient descent method to solve for these 
smooth GPMF facto rs. Many have proposed to obtain smoothness in the factors by using 
a quadratic penalty. Rice and Silverman ( 1991 1 suggested -P(v) = v, where is the 
matrix of squared second or fourth differences. As this penalty is not homogeneous of order 
one, we use the il-norm penalty: P{v) = (v-^Jlv)"^/^ = ||v||n. Since this penalty is a 
norm or a semi- norm, the GPMF solution given in Theorem [2|can be employed. We seek to 



|X^Qu^ 



V o. 



minimize the following J7-norm penalized regression problem: 
Notice that this problem is similar in structure to the group lasso problem of [Yuan and Lin 



(2006) with one group. 

To solve the il-norm penalized regression problem, we use a generalized gradient descent 
method. (We note that there are more elaborate first order solvers, introduced in recent 
works such as Becker et al. (2010, 2009), and we describe a simple version of such solvers). 

Suppose Jl has rank k. Then, set y = X Qu, and define B as B = ( I where 
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r2, and n~^^^ is taken to satisfy {n~^^^)'^ — Pfj, the Euclidean projection onto 

1/9 

the column space of n. The matrices and N can be found from the full eigenvalue 

decomposition of il, ft — T . Assuming A is in decreasing order, we can take ft~^^^ 

to be A~^(l : k,l : k){T{l : k, •))^ and N to be the last p — k rows of T. If 17 is taken to 
denote the squared differences, for example, then N can be taken as a row of ones. For the 
squared second differences, N can be set to have two rows: a row of ones and a row with 
the linear sequence 1, . . . ,p. 

1/2 

Having found Jl ' and N, we re-parametrize the problem by taking a non-degenerate 

— 1 / W\ 1/2 

linear transformation of v by setting B v = I 1 so that B w = v. Taking ' to be 

A(l : fc,l : fc)(r(l : k,-)Y, we note that ||v||ri = \\Q}''^w\\2 = || ^^^/^(fl"^/^ w +N?7)||2 = 
II w II 2, as desired. The f2-norm penalized regression problem, written in terms of (w,??) 
therefore has the form 

^||y-0-i/2w-N,?||^ + Av||w||2. (5) 

The solutions of ([s]) are in one-to-one correspondence to those of the r2-norm regression 
problem via the relation B w = v and hence, it is sufficient to solve ([5| . Consider the 
following algorithm and result: 



Algorithm 2 Algorithm for re-parametrized r2-norm penalized regression. 

1. Let L > be such that || B"^ RB ||op < L and initialize w^"). 

2. Define w^*) = w'*) +i B^ R (y - Q.-^'"^ -Nt?'*)) . 

3. Set 77(*+i) = (N^RN)t (n^ R(y - fi-^/^ w(*+i))) . 

4. Repeat Steps (b)-(c) until convergence. 



Proposition 3 The v* minimizing the fl-norm penalized regression problem is given by 



B 



where w* and rf are the solutions of Algorithm 



Since our problem employs a monotonic increasing function of the quadratic smoothing 
penalty, || v ||^, typically used for functional data analysis, then, the il-norm penalty also 
results in a smoothed factor, v. 

Computationally, our algorithm requires taking the matrix square root of the penalty 
matrix 12. For dense matrices, this is of orde r 0(p'^), but for com monly used difference 
matrices, sparsity can reduce the order to O(p^) ( Golub and Van Loan 1996 ). The matrix B, 
can be computed and stored prior to running algorithm and R does not need to be computed. 
Also, each step of Algorithm [2] is on the order of matrix multiplication. The generalized 
gradient descent steps for solving for w^*"*"^^ can also be solved by Euclidean projection onto 
{v e : v' 11 V < A}. Hence, as long as this projection is computationally feasible, the 
smooth GPMF penalty is computationally feasible for high-dimensional functional data. 

We have shown how one can use penalties to incorporate smoothness into the factors 
of our GPMF model. As with sparse GPCA, we can use this smooth penalty to perform 
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functional GPCA. The analog of our r2-norm penalized single-factor GPMF criterion with 
Au = is closely related to previously proposed methods for computing functional principal 
components ([Huang et al] |2008[ [200^. 



3.4 Selecting Penalty Parameters &z Variance Explained 



When applying the GPMF and Sparse or Functional GPCA to real structured data, there 
are two practical considerations that must be addressed: (1) the number of factors, K, to 
extract, and (2) the choice of penalty parameters, Au and Av, controlling the amount of 
regularization. Careful examination of the former is beyond the scope of this paper. For 



classical PGA, however, several methods exist for selecting the number of factors (Buja 



and Eyuboglu 1992 Troyanskaya et al. 2001 [Owen and Perry 2009). Extending the 



imputation-based methods of Troyanskaya et al. 1(2001 1 for GPCA methods is straightfor 



ward; we believe extensions of Buja and Eyuboglu (1992) and Owen and Perry (2009) are 



also possible in our framework. A related issue to selecting the number of factors is the 
amount of variance explained. While this is a simple calculation for GPCA (see Corollary 
[5|, this is not as straightforward for two-way regularized GPCA as the factors are no longer 
orthonormal in the Q, R-norm. Thus one must project out the effect of the previous factors 
to compute the cumulative variance explained by the first several factors. 



Proposition 4 LetXJk = [ui ... u^] and'Vk = [vi 



and P[^' = Vfc(V^ RVfc)-i yielding 



Vfc] and define Pfc^"* = U/c(U 



QXRP^^^ 



The 



the cumulative 



-u 



proportion of variance explained by the first k regularized GPCs is given by: tr(QXfcRX;;)/tr(QXRX'). 

Also note that since the deflation-based GPMF algorithm is greedy, the components are not 
necessarily ordered in terms of variance explained as those of the GMD. 

When applying the GPMF, the penalty parameters Au and Av control the amount of 
sparsity or smoothness in the estimated factors. We seek a data-driven way to estimate 
these penalty parameters. Many have proposed cross-validation approaches for the SVD 



(Troyanskaya et al. 2001 Owen and Perry 2009) or nested generalized cross-validation or 



Bayesian Information Criterion (BIG) approaches (Huang et al. 2009 Lee et al. 2010). 



While all of these methods are appropriate for our models as well, we propose an extension 



of the BIG selection method of Lee et al. (2010) appropriate for the Q, R-norm 



Claim 3 The BIG selection criterion for the GPMF factor v with u and d fixed at u' and 
d' respectively is given by the following: 



BIC{K) = log 



np 



\og{np) 



np 



df{K). 



The BIG criterion for the other factor u is analogous. Here, df{\^) is an estimate of the 
degrees of freedo m for the particu lar penalty employed. For the lasso penalty, for example, 
df{\v) —\ {v} I (Zou et al. 2007). Expressions for the degrees of freedom of other penalty 



functions are given in Kato (20091 and Tibshirani and Taylor (2011). 



Given this model selection criterion, one can select the optimal penalty parameter at each 



iteration of the factors for the GPMF algorithm as in Lee et al. ( 2010 ), or use the criterion to 



select parameters after the iterative algorithm has converged. In our experiments, we found 
that both of these approaches perform similarly, but the latter is more numerically stable. 
The performance of this method is investigated through simulation studies in Section |4.1[ 
Finally, we note that since the the GPMF only converges to a local optimum, the solution 
achieved is highly dependent on the initial starting point. Thus, we use the GMD solution 
to initialize all our algorithms, an approach similar to related methods which only achieve 



a local optimum ( |Witten et al.[ |2009[ |Lee et al.[ |2010[ ) 
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Table 1: Comparison of different quadratic operators for GPCA in the spatio-temporal sim- 



ulation. 





/o var K — i 


/o var /c — z 


iVlooil/ Ui 








Q 


= I(m2), R 


- \p) 


57.4 (0.9) 


19.1 (0.9) 


0.5292 (.04) 


0.6478 (.02) 


0.3923 (.02) 


0.4857 (.01) 


Q 


= S \ R = 


A-i 


75.4 (2.3) 


6.6 (0.2) 


0.1452 (.02) 


0.3226 (.02) 


0.0087 (.01) 


0.0180 (.01) 


Q 




= Lp 


42.9 (2.9) 


2.8 (0.1) 


0.1981 (.03) 


0.7972 (.02) 


0.0609 (.02) 


0.4334 (.02) 


Q 


— ^m,m-> 


= Sp 


55.8 (2.3) 


14.5 (0.3) 


0.1714 (.02) 


0.3425 (.03) 


0.0481 (.01) 


0.0809 (.02) 


Q 


^Tn.rn i 




67.6 (0.7) 


13.0 (0.8) 


0.8320 (.02) 


0.8004 (.01) 


0.5414 (.01) 


0.4831 (.01) 


Q 


^m.ra i 


— Sp 


60.3 (1.0) 


19.3 (1.0) 


0.5682 (.04) 


0.6779 (.02) 


0.4310 (.02) 


0.5030 (.01) 



4 Results 

We assess the effectiveness of our methods on simulated data sets and a real functional MRI 
example. 



4.1 Simulations 



All simulations are generated from the following model: X = S + E 

where the (/)fc's denote the magnitude of the rank-/-C signal matrix S, Zjj 

A^„,p(0,S,A). 



S and A are the row and column covariances so that E 



% Ufe - 

iid 



_ 5,1/2 2^1/2 



N{0, 1) and 
Thus, the data 



is simulated according to the matrix-variate normal distribution with mean given by the 
rank- if signal matrix, S. The SNR for this model is given by E |tr(S^ S) /E tr(E^ E) = 

E (^^u^ UfeV^ Vfe /Eftr(S)tr(A)l dGupta and Nagarl 1999D. The data is row and 



/E [tr(S)tr(A)] (Gupta and Nagar 
column centered before each method is applied, and to be consistent, the BIG criterion is 
used to select parameters for all methods. 



4.1.1 Spatio- Temporal Simulation 

Our first set of simulations is inspired by spatio-temporal neuroimaging data, and is that 
of the example shown in Figure [l] The two spatial signals, U G each consist of 

three non-overlapping "regions of interest" on a 16 x 16 grid. The two temporal signals, 
V G 5R^°°^^, are sinusoidal activation patterns with different frequencies. The signal is given 
by ~ N{l,a'^) and (/)2 ~ A^(0.5,ct^) where a is chosen so that the SNR — a^. The noise 
is simulated from an autoregressive Gaussian Markov random field. The spatial covariance, 
S, is that of an autoregressive process on the 16 x 16 grid with correlation 0.9, and the 
temporal covariance. A, is that of an autoregressive temporal process with correlation 0.8. 



The behavior demonstrated by Sparse PGA (implemented using the approach of (Shen 
and Huang 2008 )) and Sparse GPGA in Figure lllwhere cr^ = 1 is typical. Namely, when the 
signal is small and the structural noise is strong, PGA and Sparse PGA often find structural 
noise as the major source of variation instead of the true signal, the regions of interest and 
activation patterns. Even in subsequent components, PGA and Sparse PGA often find a 
mixture of signal and structural noise, so that the signal is never easily identified. In this 
example, Q and R are not estimated from the data and instead fixed based on the known 
data structure, a 16 x 16 grid and 200 equally spaced points respectively. In Table [l] we 
explore two simple possibilities for quadratic operators for this spatio-temporal data: graph 
Laplacians of a graph connecting nearest neighbors and a kernel smoothing matrix (using 
the Epanechnikov kernel) smoothing nearest neighbors. Notationally, these are denoted 
as Lm,m for a Laplacian on a m x m mesh grid or Lp for p equally spaced variables; S 
is denoted analogously. We present results when cr^ = 1 in terms of variance explained 
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TmeSfstiall 3PCA Spatial 1 SGPCA St^tial 1 






Figure 1: Example results from the spatio-temporal simulation. 



and mean squared standardized error (MSSE) for 100 replicates of the simulation. Notice 
that the combination of a spatial Laplacian and a temporal smoother in terms of signal 
recovery performs nearly as well as when Q and R are set to the true population values 
and significantly better than classical PCA. Thus, quadratic operators do not necessarily 
need to be estimated from the data, but can instead be fixed based upon known structure. 
Returning to the example in Figure [l] Q and R are taken as a Laplacian and smoother 
respectively, yielding excellent recovery of the regions of interest and activation patterns. 

Next, we investigate the feature selection properties of Sparse GPCA as compared to 
Sparse PCA for the structured spatio-temporal simulation. Note that given the above 
results, Q and R are taken as a Laplacian and a smoothing operator for nearest neighbor 
graphs respectively. In Figure [2j we present mean receiver-operator curves (ROC) averaged 
over 100 replicates achieved by varying the regularization parameter, A. In Table [2] we 
present feature selection results in terms of true and false positives when the regularization 
parameter A is fixed and estimated via the BIC method. From both of these results, we see 
that Sparse GPCA has major advantages over Sparse PCA. Notice also that accounting for 
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Figure 2: Average receiver-operator curves for the spatio-temporal simulation. 



Table 2: Feature selection results for the spatio-temporal simulation. 





True Positive Ui 


False Positive Ui 


True Positive U2 


False Positive U2 


„ Sparse PCA 
^ Sparse GPCA 


0.9859 (.008) 
0.9045 (.025) 


0.9758 (.014) 
0.0537 (.007) 


0.7083 (.035) 
0.7892 (.032) 


0.7391 (.029) 
0.1749 (.005) 


Sparse PCA 
^ ~ Sparse GPCA 


0.9927 (.004) 
0.9541 (.018) 


0.7144 (.039) 
0.0292 (.005) 


0.7592 (.034) 
0.9204 (.024) 


0.7852 (.030) 
0.1572 (.004) 



the spatio-temporal structure through Q and R also yields better model selection via BIC 
in Table [2] 

Overall, this spatio-temporal simulation has demonstrated significant advantages of 
GPCA and Sparse GPCA for data with low signal compared to the structural noise when 
the structure is fixed and known. 



4.1.2 Sparse and Functional Simulations 

The previous simulation example used the outer product of a sparse and smooth signal with 
autoregressive noise. Here, we investigate the behavior of our methods when both of the 
factors are either smooth or sparse and when the noise arises from differing graph structures. 
Specifically, we simulate noise with either a block diagonal covariance or a random graph 
structure in which the precision matrix arises from a graph with a random number of edges. 
The former has blocks of size five with off diagonal elements equal to 0.99. The latter is 
generated from a graph where each vertex is connected to nj randomly selected vertices, 

where Uj Poisson{3) for the row variables and Poisson{l) for the column variables. The 
row and column covariances, S and A are taken as the inverse of the nearest diagonally 
dominant matrix to the graph Laplacian of the row and column variables respectively. For 
our GMD and GPMF methods, the values of the true quadratic operators are assumed to 
be unknown, but the structure is known, and thus Q and R are taken to be the graph 
Laplacian. One hundred replicates of the simulations for data of dimension 100 x 100 were 
performed. 

To test our GPMF method with the smooth O-norm penalties, we simulate sinusoidal 
row and column factors: u = sm(47rx) and v = —sin{2T:x) for 100 equally spaced values, 
X G [0,1]. As the factors are fixed, the rank one signal is multiplied by (/) ~ Af(0, c^cr^), 
where c is chosen such that the SNR = . In Table 3, we compare the GMD and GPMF 



with smooth penalties to the SVD and the two-way functional PCA of Huang et al.| (2009) 
in terms of squared prediction errors of the row and column factors and rank one signal. 
We see that both the GMD and functional GPMF outperform competing methods. Notice, 
however, that the functional GPMF does not always perform as well as the un-penalized 
GMD. In many cases, as the quadratic operators act as smoothing matrices of the noise, 
the GMD yields fairly smooth estimates of the factors. Figure [3] 
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Table 3: Functional Data Simulation Results. 





Squared Error Squared Error Squared Error 
Row Factor Column Factor Rank 1 Matrix 


a = 0.5 

Block Diagonal Covariance 




SVD 

Two- Way Functional PCA 
GMD 

Functional GPMF 


3.670 (.121) 3.615 (.117) 67.422 (3.29) 
3.431 (.134) 3.315 (.127) 65.973 (3.41) 
1.779 (.138) 2.497 (.138) 60.687 (3.55) 
1.721 (.133) 1.721 (.144) 56.296 (2.81) 


Random Graph 




SVD 

Two- Way Functional PCA 
GMD 

Functional GPMF 


2.827 (.241) 2.729 (.234) 49.674 (1.50) 
1.968 (.258) 1.878 (.250) 49.694 (1.50) 
0.472 (.095) 1.310 (.138) 49.654 (1.51) 
0.663 (.049) 0.388 (.031) 49.666 (1.51) 


cr = 1.0 

Block Diagonal Covariance 




SVD 

Two- Way Functional PCA 
GMD 

Functional GPMF 


2.654 (.138) 2.532 (.143) 94.412 (6.49) 
2.426 (.142) 2.311 (.144) 92.522 (6.59) 
1.094 (.117) 1.647 (.120) 88.852 (6.73) 
1.671 (.110) 2.105 (.129) 76.864 (5.27) 


Random Graph 




SVD 

Two- Way Functional PCA 
GMD 

Functional GPMF 


2.075 (.224) 1.961 (.226) 50.392 (2.61) 
1.224 (.212) 1.187 (.206) 50.276 (2.62) 
0.338 (.075) 0.926 (.119) 50.258 (2.62) 
0.608 (.070) 0.659 (.245) 50.345 (2.60) 
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Table 4: Sparse Factors Simulation Results. 





% True Positives 
Row Factor 


% False Positives 
Row Factor 


% True Positives 
Column Factor 


% False Positives 
Column Factor 


Block Diagonal 
Covariance 




cr = 0.5 
oparse r^ivijj 
Sparse GPMF 


76.24 (1.86) 
79.72 (2.45) 


46.20 (3.16) 
1.16 (0.15) 


79.68 (1.55) 
82.60 (2.57) 


53.23 (3.16) 
1.16 (0.17) 


a = 1.0 

Snarqp PMD 

Sparse GPMF 


87.80 (1.26) 
87.12 (2.19) 


32.80 (3.34) 
1.25 (0.17) 


88.56 (1.05) 
87.24 (2.21) 


40.93 (3.29) 
1.48 (0.18) 


Random Graph 




cr = 0.5 
Sparse PMD 
Sparse GPMF 


83.56 (1.86) 
87.36 (2.45) 


39.01 (3.16) 
28.29 (0.15) 


79.44 (1.55) 
81.16 (2.53) 


33.67 (3.16) 
24.73 (0.17) 


cr = 1.0 
Sparse PMD 
Sparse GPMF 


88.04 (1.26) 
92.48 (2.19) 


46.12 (3.34) 
44.28 (0.17) 


85.36 (1.05) 
87.80 (2.21) 


48.63 (3.29) 
41.20 (0.18) 



Finally, we test our sparse GPMF method against other sparse penalized matrix decom- 



positions (Witten et al. 2009 Lee et al. 20101, in Table 4. In both the row and column 
factors, one fourth of the elements are non-zero and simulated according to iV(0,cr^). Here 
the scaling factor (p is chosen so that the SNR — <t^. We compare the methods in terms 
of the average percent true and false positives for the row and column factors. The results 
indicate that our methods perform well, especially when the noise has a block diagonal 
covariance structure. 

The three simulations indicate that our GPCA and sparse and functional GPCA methods 
perform well when there are two-way dependencies of the data with known structure. For 
the tasks of identifying regions of interest, functional patterns, and feature selection with 
transposable data, our methods show a substantial improvement over existing technologies. 



4.2 Functional MRI Example 

We demonstrate the effectiveness of our GPCA and Sparse GPCA methods on a functional 
MRI example. As discussed in the motivation for our methods, functional MRI data is a 
classic example of two-way structured data in which the nature of the noise with respect 
to this structure is relatively well understood. Noise in the spatial domain is related to 
the distance between voxels while noise in the temporal domain is often assumed to follow 
a autoregressive process or another classical time series process (Lindquist 20081. Thus, 



when fitting our methods to this data, we consider fixed quadratic operators related to 
these structures and select the pair of quadratic operators yielding the largest proportion 
of variance explained by the first GPC. Specifically, we consider Q in the spatial domain 
to be a graph Laplacian of a nearest neighbor graph connecting the voxels or a positive 
power of this Laplacian. In the temporal domain, we consider R as a graph Laplacian or a 
positive power of a Laplacian of a chain graph or a one-dimensional smoothing matrix with 
a window size of five or ten. In this manner, Q and R are not estimated from the data and 
are fixed a priori based on the known two-way structure of fMRI data. 

For our functional MRI example, we use a well-studied, publicly available fMRI data 
set where subjects were shown images and read audible sentences related to these images. 
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5 10 15 20 25 

Component 

Figure 4: Cumulative proportion of variance explained by the first 25 PCs on the starplus fMRI data. 
Generalized PCA (GPCA) and Sparse GPCA (SGPCA) yield a significantly larger reduction in dimension 
that PCA and Sparse PGA (SPG A). 



a data set which refer to as the "StarPlus" data (Mitchell et al. 2004). We study data 



for one subject, subject number 04847, which consists of 4,698 voxels (64 x 64 x 8 images 
with masking) measured for 20 tasks, each lasting for 27 seconds, yielding 54 - 55 time 
points. The data was pre-processed by standardizing each voxel within each task segment. 
Rearranging the data yields a 4,698 x 1,098 matrix to which we applied our dimension 
reduction techniques. For this data, Q was selected to be an unweighted Laplacian and R 
a kernel smoother with a window size of ten time points. In Figure [5] we present the first 
three PCs found by PCA, Sparse PCA, GPCA, and Sparse GPCA. Both the spatial PCs, 
illustrated by the eight axial slices, and the corresponding temporal PCs, with dotted red 
vertical lines denoting the onset of each new task period, are given. Spatial noise overwhelms 
PCA and Sparse PCA with large regions selected and with the corresponding time series 
appearing to be artifacts or scanner noise, unrelated to the experimental task. The time 
series of the first three GPCs and Sparse GPCs, however, exhibit a clear correlation with the 
task period and temporal structure characteristic of the BOLD signal. While the spatial 
PCs of GPCA are a bit noisy, incorporating sparsity with Sparse GPCA yields spatial 
maps consistent with the task in which the subject was viewing and identifying images then 
hearing a sentence related to that image. In particular, the first two SGPCs show bilateral 
occipital, left-lateralized inferior temporal, and inferior frontal activity characteristic of the 



well-known "ventral stream" pathway recruited during object identification tasks ( [Pennick 



and Kana 2012 1 



In Figure |4.2[ we show the cumulative proportion of variance explained by the first 25 
PCs for all our methods. Generalized PCA methods clearly exhibit an enormous advantage 
in terms of dimension reduction with the first GPC and SGPC explaining more sample 
variance than the first 20 classical PCs. As PCA is commonly used as an initial dimension 
reduction technique for other pattern recognition techniques, such as independent compo- 
nents analysis, applied to fMRI data (Lindquist, 2008), our Generalized PCA methods can 
offer a major advantage in this context. Overall, by directly accounting for the known 
two-way structure of fMRI data, our GPCA methods yield biologically and experimentally 
relevant results that explain much more variance than classical PCA methods. 
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5 Discussion 



In this paper, we have presented a general framework for incorporating structure into a ma- 
trix decomposition and a two-way penalized matrix factorization. Hence, we have provided 
an alternative to the SVD, PCA and regularized PCA that is appropriate for structured 
data. We have also developed fast computational algorithms to apply our methods to mas- 
sive data such as that of functional MRIs. Along the way, we have provided results that 
clarify the types of penalty functions that may be employed on matrix factors that are 
estimated via a deflation scheme. Through simulations and a real example on fMRI data, 
we have shown that GPCA and regularized GPCA can be a powerful method for signal 
recovery and dimension reduction of high-dimensional structured data. 

While we have presented a general framework permitting heteroscedastic errors in PCA 
and two-way regularized PCA, we currently only advocate the applied use of our method- 
ology for structured data. Data in which measurements are taken on a grid (e.g. regularly 
spaced time series, spectroscopy, and image data) or on known Euclidean coordinates (e.g. 
environmental data, epigenetic methylation data and unequally spaced longitudinal data) 
have known structure which can be encoded by the quadratic operators through smooth- 
ing or graphical operators as discussed in Section |2.5| Through close connections to other 
non-linear unsupervised learning techniques and interpretations of GPCA as a covariance 
decomposition and smoothing operator, the behavior our methodology for structured data 
is well understood. For data that is not measured on a structured surface, however, more 
investigations need to be done to determine the applicability of our methods. In particular, 
while it may be tempting to estimate both the quadratic operators, via |Allen and TibshF] 



rani 



(2010) for example, and the GPCA factors, this is not an approach we advocate as 
separation of the noise and signal covariance structure may be confounded. 

For specific applications to structured data, however, there is much future work to be 
done to determine how to best employ our methodology. The utility of GPCA would be 
greatly enhanced by data-driven methods to learn or estimate the optimal quadratic opera- 
tors from certain classes of structured matrices or an inferential approach to determining the 
best pair of quadratic operators from a fixed set of options. With structured data in partic- 
ular, methods to achieve these are not straightforward as the amount of variance explained 
or the matrix reconstruction error is not always a good measure of how well the structural 
noise is separated from the actual signal of interest (as seen in the spatio-temporal simula- 
tion. Section 4.1 1. Additionally as the definition of the "signal" varies from one application 
to another, these issues should be studied carefully for specific applications. An area of 
future research would then be to develop the theoretical properties of GPCA in terms of 
consistency or information-theoretic bounds based on classes of signal and structured noise. 
While these investigations are beyond the scope of this initial paper, the authors plan on 
exploring these issues as well as applications to specific data sets in future work. 

There are many other statistical directions for future research related to our work. As 
many other classical multivariate analysis methods are closely related to PCA, our frame- 
work for incorporating structure and two-way regularization could be employed in methods 
such as canonical correlations analysis, discriminant analysis, multi-dimensional scaling, la- 
tent variable models, and clustering. Also several other statistical and machine learning 
techniques are based on Frobenius norms which may be altered for structured data along 
the lines of our approach. Additionally, there are many areas of research related to two-way 
regularized PCA models. Theorem [2] elucidates the classes of penalties that can be em- 
ployed on PCA factors estimated via deflation that ensure algorithmic convergence. This 
convergence, however, is only to a local optimum. Methods are then needed to flnd good 
initializations, to estimate optimal penalty parameters, to find the relevant range of penalty 
parameters comprising a full solution path, and to ensure convergence to a good solution. 
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Furthermore, asymptotic consistency of several approaches to Sparse PCA has been estab- 
hshed (Johnstone and Lu 2004 Amini and Wainwright 2009 Ma 2010), but more work 



needs to be done to estabhsh consistency of two-way Sparse PCA. 

Finally, our methodology work has broad implications in a wide array of applied disci- 
plines. Massive image data is common in areas of neuroimaging, microscopy, hyperspectral 
imaging, remote sensing, and radiology. Other examples of high-dimensional structured 
data can be found in environmental and climate studies, times series and finance, engineer- 
ing sensor and network data, and astronomy. Our GPCA and regularized GPCA methods 
can be used with these structured data sets for improved dimension reduction, signal recov- 
ery, feature selection, data visualization and exploratory data analysis. 

In conclusion, we have presented a general mathematical framework for PCA and regu- 
larized PCA for massive structured data. Our methods have broad statistical and applied 
implications that will lead to many future areas of research. 
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A Proofs 

Proof 1 (Proof of Theorem |l]) We show that the GMD problem, atJJ* ,13* ,V* is equivalent 
to the SVD problem, jlj, for X at U,D,V. Thus, if tj, D,V minimizes the SVD problem, then 
U*,D*, V* minimizes the GMD problem, ([2jl. We begin with the objectives: 

||X-U* D*(V*)'^||^ = tr (qXRX"^) - 2tr U* D*(V*)^RX^) 

-I- tr (q U* D*(V*)^ R V*(D*)^(U*)^) 
= tr (qQ^ XRR^ X^) - 2tr (qQ^Q"^UDv'^(R"^)^RR^ X^ q) 

-I- tr (qQ^Q"^UDv'^(R~^)^RR^R"^VD^U^Q"^) 
= tr (XX^) - 2tr (uDV^X^) + tr (uDV^VD^U^) = ||X - UDV^|||,. 

One can easily verify that the constraint regions are equivalent: {U*)-^QU* = TJ'^Q ^ Q(Q ^)'^U = 

~ T ~ 

U U and similarly for V and R. This completes the proof. 

Proof 2 (Proof of Corollaries to Theorem [T]) These results follow from the properties of the 
SVD \Eckart and Young\\1936\\Horn and Johnson\\l985\\Golub and Van Loan\\1996f^ and the relationship 
between the GMD and SVD given in Theorem [7] For Corollary [2| we use the fact that rank(AB) < 
min{rank(A), rank(B)} for any two matrices A and B; equality holds if range{A) r\null(B) = 0. Then, 
rank(X) = k = min{rank(X), rank{Q), rank(R)} . 

Proof 3 (Proof of Proposition [T]) We show that the updates ofu and V in the GMD Algorithm 
are equivalent to the updates in the power method for computing the SVD ofK. Writing the update for Ufc 
in terms ofX., u, and v (suppressing the index k), we have: 

(Q-i.T-_ ((Q''rXR-')R(R-^rv ^. ^ Xv Xv 

||((Q"'rXR-^)R(R-'rv||q' ^^^X^Q"^ Q(Q-^)^Xv H^^H^' 
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Notice that this last form in u is that of the power method for computing the SVD of X Golub and 
\Van Loari\\199^ . A similar calculation for v yields an analogous form. Therefore, the GMD Algorithm is 
equivalent to the algorithm which converges to the SVD ofX.. 

Proof 4 (Proof of Proposition [2]) We show that the objective and constraints in ^ at the GMD 

solution are the same as that of the rank k PGA problem for X. (For simplicity of notation, we suppress 
the index k in the calculation.) 

RX^ QXRv = v^(R"^)^ RR"^X(Q"^)^ Q Q^X(R^)^ RR^^v = v^X^Xv 
Rv = v'^(R~^)^ RR^^v = v^v. 

Then, the left singular vector, v, that maximizes the PGA problem, is the same as the left GMD factor that 
maximizes (jsji. 

Proof 5 (Proof of Corollary [s]) This result follows from the relationship of the GMD and GPGA 
to the SVD and PGA. Recall that for PGA, the amount of variance explained by the fc*'^ PG is given by 
d^/||X|||, where dk is the fc*'' singular value ofJQ. Then, notice that the stated result is equivalent to the 
proportion of variance explained by the fc*'' singular vector of:>L: dl/tr{X.'K ) = d2/tr(Q X RX^). 

Proof 6 (Proof of Theorem l2h We will show that the result holds for V, and the argument for u 
is analogous. Gonsider optimization ofm with respect to v with u fixed at u'. The problem is concave in v 
and there exists a strictly feasible point, hence Slater's condition holds and the KKT conditions are necessary 
and sufficient for optimality. Letting y = X"^ Q u, these are given by: Ry — Av VPi(v*) — 27* R v* = 
and 7*((v*)-^ Rv* — 1) = 0. Here, VPi(v*) is a subgradient of Pi{) with respect to v*. Now, consider 
the following solution to the penalized regression problem: v = argmin-^,{||| y — v ||^ + Avfi(v)}. This 
solution must satisfy the subgradient conditions. Hence, V c > 0, we have: 

= Ry- Av VPi(v) - Rv = Ry- Av VPi(v) - R(cv)i = Ry- Av VPi(v) - Rv^, 

c c 

where v = cv. The last step follows because since Pi{) is convex and homogeneous of order one, VPi(x) = 
VPi(cx) V c> 0. 

Let us take c = 1/||v||r; we see that this satisfies c > 0. Then, letting 7* = ^ = I|v||r/2, we see 
that the subgradient condition of ijlj is equivalent to that of the penalized regression problem. Putting these 
together, we see that the pair (v* = v/||v||ii, 7* = ||v||pi/2) satisfies the KKT conditions and is hence the 
optimal solution. Notice that 7* = only if ||v||r = 0. FVom our discussion of the quadratic norm, recall 
that this can only occur if v a null{K) or v = 0. Since we exclude the former by assumption, this implies 
that if V = 0, then 7* = and the inequality constraint in ^ is not tight. In this case, it is easy to verify 
that the pair (v* =0,7* = 0) satisfy the KKT conditions. Thus, we have proven the desired result. 

Proof 7 (Proof of Claim [l]) Examining the subgradient condition, we have that v = y— AR "'^l(p)VP( 
This can be written as such because the subgradient equation is separable in the elements of v when R is 
diagonal. Then, it follows that soft thresholding the elements of y with the penalty vector A R"-"- l^^j solves 
the subgradient equation. 

Proof 8 (Proof of Claim [2]) We can write the problem, ^ || y - v | In + A v 1 1 v | |i, as a lasso re- 
gression problem in terms of R, |||R y— R v||| + Av||v||i. Friedman et al. 2001) established that 
the lasso regression problem can be solved by iterative coordinate- descent. Then, optimizing with respect to 
each coordinate, Vj: Vj = _ ^ ^- — 5 ("R^^Ry — Rj?"jRr,7^jV:^j, A j . Now, several of these quantities can be 

~ T ~ ~ T ~ 

written in terms 0/ R so that R does not need to be computed and stored: R^^R^j = Rjj, R^^R = Rrj, 
~ T ~ 

and R,,jR,.,^j = Rj,:^j. Putting these together, we have the desired result. 

Proof 9 (Proof of Proposition [S]) We will show that Algorithm [i| uses a generalized gradi- 
ent descent method converging to the minimum of the Si-norm generalized regression problem. First, 
since B is full rank, there is a one to one mapping between v and w. Thus, any (w*,?;*) minimizing 

/(w,?;) = ^11 y — fl"-'''^ w — N»7|||j^ + Av II w II2 yields v* = 'B ^^^^ which minimizes the Si-norm gener- 
alized regression problem. 
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Now, let us call the first and second terms in f{wj,ri), g{'w,r}) and h{w,r]) respectively. Then, h{w,ri) 
and g{v/,r]) are obviously convex and g{w,T]) is Lipschitz with a Lipschitz constant upper bounded by 

II RB Hop. To see this, note that Vg(w, rf) = — B^ R — B ^^^^ linear in (w, ri). Therefore, we 

can take the operator norm of the linear part as a Lipschitz constant. We also note that /(wrj) is separable 
in w and rj and hence block-wise coordinate descent can be employed. 

Putting all of these together, the generalized gradient update for w and the update for rj t hat converge to 
the minimum of the ii-norm penalized regression problem is given by the following updates \Vandenberghe\ 
\200^\Beck and Tebou[l^\2009f . 



L,(t+i) = argmin ^ — 



+i B^ R (y - wW -N,,W' 



2 

+ Av| 

2 



ri'^^+'-) = argmin /(w^'+-^J,»7). 

V 

Notice that the first minimization problem is of the form of the group lasso regression problem with an 
identity predictor matrix. The solution to this is problem given in | Yuan and Li/r\ \200(j^ and is the update 
in Step (b) of Algorithm^ The second minimization problem is a simple linear regression and is the update 
in Step (c). This completes the proof. 



Proof 10 (Proof of Proposition |4]) This result follows from an argument in Shen and Huang 
i fgQQgp . Namely, since our regularized GPCA formulation does not impose orthogonality of subsequent 
factors in the Q, H-norm, the effect of the first k factors must be projected out of the data to com- 
pute the cumulative proportion of variance explained. \Shen and Huan^ JgQQgp showed that for Sparse 
PGA with a penalty on v, the total variance explained is given by tr(Xj. X^) where Xj; = XP^^' = 
X Vfe ( V J V J. ) ~ with'V)^ = [vi ... vj.]. When penalties are incorporated on both factors, one must 
define Xj. = P^'^' XP^^'. We show that the total proportion of variance explained (the numerator of our 
stated result) is equivalent to tr(Xj.Xfe), where Xt; is equivalent to that as defined in Theorem Q] First, 
notice that P^"^ = Q"^Ufc(U^(Q"^)^ Q Q"^Ufc)-iUfe (Q~^)^ = P^"' (Q"^)^ and equivalently, 

W^^^ = R"^ P<^^'{R"^)^. Then, Xfc = P*;^"' Q X R P^^' = Q"^ P^"' XP^"^' R"\ and we have that 
tr(QXfcRX^) = QQ^Q^^PJ."' XP[^^> R~^RR^(R"^)T P<^^' X^P^"\q"^)T = tr(XfcXfc), thus giv- 
ing the desired result. 



Proof 11 (Proof of Claim ISb Lee et al. '2010) show that estimating v in the manner described 



in our GPMF framework is analogous to a penalized regression problem with a sample size of np and p 
variables. Using this and assuming that the noise term of the penalized regression problem is unknown, 
we arrive at the given form of the BIG. Notice that the sums of squares loss function is replaced by the 
Q, R-norm loss function of the GMD model. 
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